Last week, UNESCO published a report named: COVID-19 AND SCHOOL CLOSURES - ONE YEAR OF EDUCATION DISRUPTION. They point out that Latin America and the Caribbean is home to 3 out of 5 children who lost an entire school year worldwide. In other words, they wrote that the region accounts for almost 60 per cent of all children who missed an entire school year due to COVID-19 lockdowns across the world, according to new data released today by UNICEF.

So, I searched for the original databases and found only three available. 1) Date, ISO, Country and Status 2) UNICEF Region, Average days closed weighted by number of students and Type 3) Country, Income Group, Days: Academic break Days: Fully closed, Days: Fully open, Days: Partially closed, Instruction Days, Number of students.

To find a new angle I began investigating the databases doing the following:

import pandas as pd
import numpy as np
pd.set_option("display.max_columns", 200)
pd.set_option("display.max_colwidth", 200)
/Users/biancapallaro/.pyenv/versions/3.8.2/lib/python3.8/site-packages/pandas/compat/__init__.py:120: UserWarning: Could not import the lzma module. Your installed Python is incomplete. Attempting to use lzma compression will result in a RuntimeError.
  warnings.warn(msg)
#Import database
df = pd.read_csv("covid_impact_education.csv")
df.head(15)
Date ISO Country Status Note
0 16/02/2020 ABW Aruba Fully open NaN
1 16/02/2020 AFG Afghanistan Fully open NaN
2 16/02/2020 AGO Angola Fully open NaN
3 16/02/2020 AIA Anguilla Fully open NaN
4 16/02/2020 ALB Albania Fully open NaN
5 16/02/2020 AND Andorra Fully open NaN
6 16/02/2020 ARE United Arab Emirates Fully open NaN
7 16/02/2020 ARG Argentina Fully open NaN
8 16/02/2020 ARM Armenia Fully open NaN
9 16/02/2020 ATG Antigua and Barbuda Fully open NaN
10 16/02/2020 AUS Australia Fully open NaN
11 16/02/2020 AUT Austria Fully open NaN
12 16/02/2020 AZE Azerbaijan Fully open NaN
13 16/02/2020 BDI Burundi Fully open NaN
14 16/02/2020 BEL Belgium Fully open NaN
#Look at the types
df.dtypes
Date        object
ISO         object
Country     object
Status      object
Note       float64
dtype: object
#Convert date to datetime. 
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
df.dtypes
Date       datetime64[ns]
ISO                object
Country            object
Status             object
Note              float64
dtype: object
#Import altair
import altair as alt
from vega_datasets import data
alt.data_transformers.disable_max_rows()
DataTransformerRegistry.enable('default')

I wanted to create an overview graph about how governments began closing schools and moving to distance learning. On March 11th, 2020, when the World Health Organization declared the novel coronavirus (COVID-19) outbreak a global pandemic, only 26 schools worldwide were closed due to covid-19. By March 31st, almost 170 countries had moved to distance learning. This situation begins to change in May, where a decline in the number of countries observing full school closures is accompanied by an increase in the number of countries where schools are partially or fully open. In July, more than 100 countries started academic break, and eventually in September, at least 80 decided to resume in person classes and only 40 were completely shut down, Today, only 27 are fully closed.

alt.Chart(df).mark_area().encode(
    alt.X('Date'),
    y = alt.Y('count()'),
    color = 'Status',
    tooltip = ('count()', 'Date', 'Status')
).properties(
    width=850,
    height=300
)
#Make graph by continents. 
#Import data by region
df3 = pd.read_csv("region.csv")
df3.head(10)
UNICEF Region Average days Type
0 Latin America and Caribbean 158 Fully closed
1 South Asia 146 Fully closed
2 Eastern and Southern Africa 101 Fully closed
3 Middle East and North Africa 90 Fully closed
4 West and Central Africa 77 Fully closed
5 Eastern Europe and Central Asia 59 Fully closed
6 East Asia and Pacific 56 Fully closed
7 Western Europe 52 Fully closed
8 North America 0 Fully closed
9 Global 95 Fully closed

The highest average number of days when in-person classroom instruction was disrupted is seen in Latin America and the Caribbean region, followed by South Asia, and Eastern and Southern Africa. Schools in Latin America and the Caribbean remained shut down for 158 days from March 2020 to February 2021, longer than the global estimate (95 days). Schools in Latin America and the Carribean stayed fully open only 6 days last year and in South Asia 7. While the global average is 37 days.

alt.Chart(df3).mark_bar().encode(
    alt.X('Average days'),
    alt.Y('UNICEF Region'),
    alt.Color('Type'),
    tooltip = 'Average days',
    order=alt.Order(
      'Type',
      sort='ascending'
    )
).properties(
    width=700,
    height=300
)
#Import new database
df3 = pd.read_csv("days_students.csv")
df3.head()
ISO3 UNICEF Country UNICEF Region Income Group Days: Academic break Days: Fully closed Days: Fully open Days: Partially closed Instruction Days Pre-primary Primary Lower Secondary Upper Secondary
0 AFG Afghanistan South Asia Low income (L) 32 115 55 33 203 24,220 6,544,906 1,982,869 1,081,020
1 AGO Angola Eastern and Southern Africa Lower middle income (LM) 0 139 9 87 235 784,381 5,620,915 1,525,954 508,196
2 AIA Anguilla Latin America and Caribbean NaN 62 20 93 60 173 434 1,646 637 422
3 ALB Albania Eastern Europe and Central Asia Upper middle income (UM) 77 41 92 25 158 81,026 170,861 148,810 120,062
4 AND Andorra Western Europe High income (H) 50 77 105 3 185 2,204 4,325 2,985 1,528
df3.shape
(200, 13)
#This a covid-19 database from The New York Times that contains the total number of cases per million. 
df4 = pd.read_csv("owid-covid-data.csv")
df4.head()
Countries ISO3 Total cases per million Total deaths per million
0 Afghanistan AFG 1435.355 62.962
1 Albania ALB 39467.649 679.686
2 Algeria DZA 2608.421 68.824
3 Andorra AND 143260.208 1449.557
4 Angola AGO 642.239 15.670
df4.shape
(194, 4)
#I wanted to see the relationship between the covid-19 cases and school closures so I merged the two databses.
#How can I see the 6 countries that are not in the database?
new_table = pd.merge(df3, df4, on="ISO3")
#I created a scatter polot to analyze the the realtionship between covid-19 cases and school closures
interval = alt.selection_interval()
chart1 = alt.Chart(new_table).mark_point().encode(
    x = 'Days: Fully closed',
    y = 'Total cases per million',
    color = alt.condition(interval, 'UNICEF Region', alt.value('lightgray')),
     tooltip = 'Countries',
).properties(
    selection = interval
).properties(
    width=750,
    height=300
)

chart1
#I sorted the data by days fully closed days because I thought Altair would graph it in that same order. But it didn't... 
df4 = df3.sort_values(by='Days: Fully closed', ascending=False)
df4.head()
ISO3 UNICEF Country UNICEF Region Income Group Days: Academic break Days: Fully closed Days: Fully open Days: Partially closed Instruction Days Pre-primary Primary Lower Secondary Upper Secondary
138 PAN Panama Latin America and Caribbean High income (H) 23 211 1 0 212 95,481 418,852 200,934 121,979
158 SLV El Salvador Latin America and Caribbean Lower middle income (LM) 30 205 0 0 205 230,010 662,740 308,565 213,011
16 BGD Bangladesh South Asia Lower middle income (LM) 33 198 4 0 202 3,578,384 17,338,100 8,497,398 7,372,422
23 BOL Bolivia (Plurinational State of) Latin America and Caribbean Lower middle income (LM) 40 192 1 2 195 353,898 1,379,099 445,168 788,570
24 BRA Brazil Latin America and Caribbean Upper middle income (UM) 34 191 1 9 201 5,101,935 16,106,812 13,414,172 9,704,007

Of course this is only a first approach. I still don't have a clear angle because I also want to make a scatter plot comparing number of days schools were closed with: a) Internet access b) Income c) Number of students (in millions) who have missed class. So there is still a lot I want to do.